-
Notifications
You must be signed in to change notification settings - Fork 147
WIP: feat: added logic to handle cert tnf cert rotation #1493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughAdds a restart-etcd command to the setup runner; introduces an operator Secret event handler to detect etcd cert changes and trigger restart jobs; exposes a PCS API to restart etcd; extends job tooling with a restart-etcd type and timeout; and adds a runner that invokes the PCS restart. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro Cache: Disabled due to data retention organization setting Knowledge base: Disabled due to 📒 Files selected for processing (5)
🚧 Files skipped from review as they are similar to previous changes (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: eggfoobar The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (5)
cmd/tnf-setup-runner/main.go
(3 hunks)pkg/tnf/operator/starter.go
(4 hunks)pkg/tnf/pkg/pcs/etcd.go
(1 hunks)pkg/tnf/pkg/tools/jobs.go
(3 hunks)pkg/tnf/restart-etcd/runner.go
(1 hunks)
for _, node := range nodeList { | ||
runJobController(ctx, tools.JobTypeRestartEtcd, &node.Name, controllerContext, operatorClient, client, kubeInformersForNamespaces) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid spawning restart job controllers on every secret event
runJobController
spins up a long-lived controller (go ...Run(ctx, 1)
). Calling it inside the cert-change handler means every secret update (and even re-list events) will start another controller instance per node, leading to unbounded goroutines and duplicated informers fighting over the same tnf-restart-etcd-job-*
resources. Please start these controllers once (e.g. alongside the other TNF job controllers) and let the handler only manage job lifecycle triggers.
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-etcd-certrotation |
@eggfoobar: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7ace3120-a3ca-11f0-859f-c3a8764c54a7-0 |
added a handler func during event watch to trigger a job on both nodes to restart the podman-etcd service Signed-off-by: ehila <[email protected]>
e4b1946
to
00d5025
Compare
@eggfoobar: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this can work, because AFAIK there is a delay between the cert Secret being updated, and the actual files on the nodes being synced. I'm afraid the podman-etcd restart happens too early.
Also, I tried to create a cluster with this patch. Looks like it runs the new restart jobs too early, when the pacemaker cluster wasn't setup yet, which fails cluster creation:
$ k get job,pod
NAME STATUS COMPLETIONS DURATION AGE
job.batch/tnf-after-setup-job-master-0 Complete 1/1 4m35s 20h
job.batch/tnf-after-setup-job-master-1 Complete 1/1 4m34s 20h
job.batch/tnf-auth-job-master-0 Complete 1/1 6s 20h
job.batch/tnf-auth-job-master-1 Complete 1/1 7s 20h
job.batch/tnf-fencing-job Complete 1/1 4m31s 20h
job.batch/tnf-restart-etcd-job-master-0 Failed 0/1 20h 20h
job.batch/tnf-restart-etcd-job-master-1 Failed 0/1 20h 20h
job.batch/tnf-setup-job Complete 1/1 4m21s 20h
NAME READY STATUS RESTARTS AGE
pod/etcd-guard-master-0 1/1 Running 0 20h
pod/etcd-guard-master-1 1/1 Running 0 20h
pod/etcd-master-0 4/4 Running 0 20h
pod/etcd-master-1 4/4 Running 0 20h
pod/installer-3-master-1 0/1 Completed 0 20h
pod/installer-5-master-0 0/1 Completed 0 20h
pod/installer-5-master-1 0/1 Completed 0 20h
pod/installer-6-master-0 0/1 Completed 0 20h
pod/installer-6-master-1 0/1 Completed 0 20h
pod/installer-7-master-0 0/1 Completed 0 20h
pod/installer-7-master-1 0/1 Completed 0 20h
pod/revision-pruner-7-master-0 0/1 Completed 0 20h
pod/revision-pruner-7-master-1 0/1 Completed 0 20h
pod/tnf-after-setup-job-master-0-8btwt 0/1 Completed 0 20h
pod/tnf-after-setup-job-master-1-h4k9r 0/1 Completed 0 20h
pod/tnf-auth-job-master-0-fwr26 0/1 Completed 0 20h
pod/tnf-auth-job-master-1-4z72f 0/1 Completed 0 20h
pod/tnf-fencing-job-8jzgf 0/1 Completed 0 20h
pod/tnf-restart-etcd-job-master-0-2zzcw 0/1 Error 0 20h
pod/tnf-restart-etcd-job-master-0-4rpwm 0/1 Error 0 20h
pod/tnf-restart-etcd-job-master-0-5lqzp 0/1 Error 0 20h
pod/tnf-restart-etcd-job-master-0-lld2c 0/1 Error 0 20h
pod/tnf-restart-etcd-job-master-1-2c6dp 0/1 Error 0 20h
pod/tnf-restart-etcd-job-master-1-7vx2c 0/1 Error 0 20h
pod/tnf-restart-etcd-job-master-1-pssr6 0/1 Error 0 20h
pod/tnf-restart-etcd-job-master-1-tqhks 0/1 Error 0 20h
pod/tnf-setup-job-d9prb 0/1 Completed 0 20h
$ k logs tnf-restart-etcd-job-master-0-2zzcw
I1015 13:13:41.793135 14200 runner.go:12] Running TNF etcd restart
I1015 13:13:41.793260 14200 etcd.go:37] Checking pcs resources
I1015 13:13:41.793298 14200 exec.go:24] Executing: /usr/bin/nsenter -a -t 1 /bin/bash -c /usr/sbin/pcs resource status
I1015 13:13:43.190362 14200 exec.go:38] stdout:
I1015 13:13:43.190670 14200 exec.go:39] stderr: Error: unable to get cluster status from crm_mon
crm_mon: Connection to cluster failed: Connection refused
I1015 13:13:43.190726 14200 exec.go:41] err: exit status 1
E1015 13:13:43.190744 14200 etcd.go:41] exit status 1Failed to get pcs resource statusstdoutstderrError: unable to get cluster status from crm_mon
crm_mon: Connection to cluster failed: Connection refused
errexit status 1
F1015 13:13:43.190795 14200 main.go:126] exit status 1
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
added a handler func during event watch to trigger a job on both nodes to restart the podman-etcd service